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We present a submatrix update algorithm for the continuous-time auxiliary field method that 
allows the simulation of large lattice and impurity problems. The algorithm takes optimal advantage 
of modern CPU architectures by consistently using matrix instead of vector operations, resulting in a 
speedup of a factor of « 8 and thereby allowing access to larger systems and lower temperature. We 
illustrate the power of our algorithm at the example of a cluster dynamical mean field simulation of 
the Neel transition in the three-dimensional Hubbard model, where we show momentum dependent 
self-energies for clusters with up to 100 sites. 



PACS numbers: 71.27. +a,02.70.Tt,71.10.Fd 

The theoretical investigation of correlated fermionic 
lattice systems has been one of the most challenging tasks 
in condensed matter physics. Many of these systems are 
not tractable with controlled analytic approximations in 
the regimes of interest, so that numerical simulations 
need to be employed. Several numerical approaches exist: 
With exact diagonalization 1 - (ED) one calculates the ex- 
act eigenstates of a system on a small lattice. Because the 
Hilbert space grows exponentially with lattice size, ED 
is limited to comparatively small systems. Variational 
methods like the density matrix rcnormalization group 
theory^ 3 - (DMRG) work well in one dimension, but ex- 
tensions to two-dimensional systems^— are still under de- 
velopment. Standard lattice Monte Carlo methods^ are 
hampered by the fermionic sign problem^^ that limits 
access to large system size or low temperature away from 
half filling. 

Systems with a large coordination number are often 
studied within the dynamical mean field approximation 
(DMFT) 1 ^. Early studies by Metzner and Vollhardl^ 
and Muller-Hartmanni 4 - showed that the diagrammatics 
of interacting fermions becomes purely local in the limit 
of infinite coordination number. In this case the solution 
of the lattice model may be obtained from the solution 
of an impurity model and an appropriately chosen self- 
consistency condition 1 ^. 

Later work on cluster extensions of DMFT— ~— took 
into account non-local correlations in addition to the lo- 
cal correlations already contained within the DMFT by 
considering "cluster" impurity models with an internal 
momentum structure 2 ^. These cluster approximations 
are based on a self-energy expansion in momentum space, 
iS) « J2k" ^K{^>)(t>K{k^r^ that becomes exact in the 
limit of a complete momentum space basis (N c — > oo) and 
can therefore be controlled by increasing the cluster size. 

Quantum impurity models are well suited to nu- 
merical study. Methods for their solution include 



numerical renormalization group approaches^ 2 -, exact 
diagonalization^ 3 -, and approximate semi-analytical re- 
summation of classes of diagram a 15 ' 24 ' 25 . However, un- 
til a few years ago only the Hirsch-Fye quantum Monte 
Carlrj 2 ^ algorithm was able to obtain unbiased and nu- 
merically exact solutions of large cluster impurity prob- 
lems at intermediate interaction strength. This changed 
with the development of continuous-time methods 2 ^ - — . 
The vastly better scaling^ 3 , of these methods and the ab- 
sence of discretization errors allowed access to lower tem- 
peratures, larger interactions, and more orbitals. 

Large cluster calculations remain computationally 
challenging as the numerical cost - even in the absence 
of a sign problem - scales as 0[(N C /3U) 3 ] in the case 
of the interaction expansio n 28 ! 31 , and 0[exp(N c )/3 3 ] in 
the hybridization expansion 3 ^ methods (for single orbital 
cluster Anderson models at inverse temperature (3 and 
interaction U for a cluster of size N c ). It is therefore 
important to develop efficient algorithms to solve cluster 
impurity models. 

Two numerical algorithmic improvements have signif- 
icantly increased the size of systems accessible by sim- 
ulations with the Hirsch-Fye algorithm: the "delayed" 
updates 3 ^, and the "submatrix" updates 3 ^. An impor- 
tant question is therefore if these techniques may be gen- 
eralized to the continuous-time algorithms and whether 
similar savings in computer time may be expected, and 
how these savings translate into newly accessible physics. 

Both "delayed" and "submatrix" updates are mainly 
based on efficient memory management; "submatrix" up- 
dates further reduce the algorithmic complexity of the 
updating procedure. Modern computer architectures cm- 
ploy a memory hierarchy: Calculations are performed on 
data loaded into registers. Any data that are not in the 
registers are stored either in the "cache" (currently with 
a size of a few MB) or in the "main memory" (with a 
size of a few GB). The cache is relatively fast, but there 
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is little of it, while access to the main memory is often 
slow and shared among several compute cores. The bot- 
tleneck in many modern scientific applications, including 
the continuous-time algorithms, is not the speed at which 
computations are performed, but the speed at which data 
can be loaded from and stored into main memory. 

The central object in continuous-time algorithms is a 
matrix, which for large cluster calculations does not fit 
into the cache. Monte Carlo updates often consist of 
rank-one updates or matrix-vector products. Such up- 
dates perform 0(m 2 ) operations on 0(m 2 ) data, where m 
is the average matrix size, and therefore run at the speed 
of memory. Matrix- matrix operations [with 0(m 3 ) oper- 
ations executed on 0(m 2 ) data] could run at the speed 
of the registers, as more (fast) calculation per (slow) load 
/ store operation are performed. The reason behind the 
success of both the "submatrix" and the "delayed" up- 
dates is the combination of several (slow) successive rank- 
one operations into one fast matrix-matrix operation, at 
the cost of some minimal overhead. This is illustrated in 
Fig. HI 




FIG. 1. (Color online) Illustration of update formulas. 1(a) 
"rank-one" updates of Ref. |3l|, accessing 0(m 2 ) data points 
for 0(m 2 ) operations and performing one update. 1(b) 



submatrix updates, accessing 0(m 2 ) values but performing 
0(rn 2 k) operations, for k updates. 

The delayed update algorithm can be straightfor- 
wardly generalized to (non-ergodic) spin-flip operations 
in the interaction expansion (CT-INT) and continuous- 
time auxiliary field algorithms (CT-AUX)2&, and an 
adaptation of the concept of delayed updates to vertex 
insertion and removals in the interaction expansion was 
recently proposed by Mikelsons^I. 

In this article we present a generalization of the "sub- 
matrix" technique of Ref. HH to the CT-AUX algorithm, 
which uses fewer redundant operations than "delayed" 
updates. We find a speed increase of ~ 8 for a typical 
large cluster impurity problem. We demonstrate the scal- 
ing both as a function of computational resources and as 
a function of problem size, and we show results for con- 
trolled large-scale cluster calculations. 



The paper is structured as follows: In Sec. U we rein- 
troduce the CT-AUX algorithm and describe the Monte 
Carlo random walk procedure. In Sec. [TT] we introduce 
the submatrix updates, and in Sec. IHII we apply them to 
CT-AUX. Section IIVI shows physics and benchmarking 
results for the new algorithm, and Sec. [V] contains the 
conclusions. 



I. THE CONTINUOUS-TIME AUXILIARY 
FIELD ALGORITHM 

We present the submatrix updates for CT-AUX^L, for 
which the linear algebra is similar to the well-known 
Hirsch Fye2£ method. To introduce notation and con- 
ventions we rep eat the important parts of the deriva- 
tion of Ref. l3l[ limiting ourselves to the description of 
the dynamical mean field solution of the single orbital 
Anderson impurity model. Lattice problems [i.e., prob- 
lems without hybridization terms in the Hamiltonian and 
with no (cluster) dynamical mean field self-consistency 
imposed] differ only in the form of the non-interacting 
Green's function. Their simulation proceeds along the 
same lines and will not be treated separately here. 



A. Partition Function Expansion 

The Hamiltonian of the single orbital Anderson im- 
purity model describes the behavior of an impurity (de- 
scribed by operators d a ,ai) with an on-site energy eo 
and on-site interaction U coupled by a hybridization with 
strength V pa to a bath (described by a pa ,a' pa ) with dis- 
persion e p : 

H = H + V, (1) 
H a = -(e Q -U/2)(n t + r H ) 

(2) 



<T,P 



a,p 



V = u 



(3) 



n a = d\d a denotes the impurity occupation. 
Continuous-time algorithms expand expressions for the 
partition function Z = Trexp(— j3H) (at inverse temper- 
ature /3) into a diagrammatic series. In CT-AUX, the 
series is a perturbation expansion in the interaction: 



n>0 



(l-^)...e-^>*(l-^)e^*o]. (4) 
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The interaction term V in this expansion can be decou- 
pled with an auxiliary field^ 



1 



/3V _ 1 



E 



7s(n t -n 4 ) 



s=-l,l 

cosh( 7 ) = 1 + (fiU)/(2K), 



(5a) 
(5b) 



introducing an arbitrary constant K and auxiliary 
"spins" s. Hence 

Z =E E / d H^) Z »> ( 6 ) 

~ Ki<n 



Z„({ Sl ,r 4 })=Tr[] e - A ^ ffo e 



Si7(«t-™4-) 



(7) 



Note that the insertion of an arbitrary number of "inter- 
action vertices" (auxiliary spin and time pairs) {sj,Tj) 
with Sj = into Eq. © docs not change the value of 
Z n {{si,Ti\). We will refer to auxiliary spins with value 
s n = as "non-interacting" spins. 

We can express the trace of exponentials of one-body 
operators in Eq. © as a determinant of a (n x n) matrix 
N, 



Z n {{Si,Ti}) 



N^({ Si , n }) 



J] detJV-HKrJ), (8) 

<r=t4 



-05?' e 



rt*i} 



- 1 



(9) 



diag(e^- 1 '" Sl ,...,e^- 1 ^ s "). (10) 

denotes a (n x n) matrix of bare Green's functions, 

(^Oa-'^)y = Goo-(Ti — Tj). From now on we will omit the 
spin index a. 

The matrix N is related to the Green's function matrix 
G by G = NGq. The matrices G and N for auxiliary spin 
configurations that have the same imaginary time loca- 
tion for all vertices, but differ in the value of an auxiliary 
spin s p , are related by a Dyson equation 

N-j = N l3 + (G ip - S ip )XN pj , (11a) 

Gjj = Gij + (Gi p — 5i P )XGpj, (Hb) 

A = e v »~ Vp — 1. (11c) 

This relation is the basis for spin-flip updates. 



To guarantee ergodicity of the sampling it is sufficient 
to insert and remove spins with a random orientation 
Si =t, 4- a t random times < Tj < j3. Spin insertion 
updates are balanced by removal updates. For an inser- 
tion update we select a random time in the interval [0, 0) 
and a random direction for this new spin, leading to a 
proposal probability pP ro P(n ->• n + 1) = (l/2)(dr//3). 
For removal updates a random spin is selected and pro- 
posed to be removed, leading to a proposal probability 
pPro P ( n+1 n ) = i/( n +l). The combination of Eq. ® 
with these proposal probabilities leads to the Metropolis 
acceptance rate p(n — > n + 1) = min(l, R) with 



R = 



n + KiL dettM-V 1 



(13) 



where (n) denotes the dimension of -/V" 1 . 

In addition to the insertion and removal updates 
we consider spin flips of auxiliary spins. These up- 
dates are self-balancing, and the transition probabil- 
ity from a state {(si,ti), • • • , (sj,T»), ■ • • } to a state 
{{sun), • • • , (-Si,Ti), • • • } is given by 



det[N^{{{ Sl ,n),--- .(-s^),---})]- 1 



r= n 

detlTVnU^n),---,^),---})]- 1 



(14) 



In the particle hole symmetric case the parameter K 
may be chosen such that only even orders in the perturba- 
tion series occur and that the average perturbation order 
is half as larg e as the one of the algorithm presented here 
(see Ref. [39| for details in the real-time context, where 
this scheme allowed propagation to much longer times). 
As the resulting algorithm is less general and requires 
double- vertex insertions it will not be explored here. 

Non-interacting auxiliary spins, or auxiliary spins with 
value 0, do not change the value of Z n in Eq. [7] We 
will make use of this fact to precompute a matrix that 
is equivalent to N but contains non-interacting vertices 
represented by spin auxiliary spins. Insertion and re- 
moval updates then become equivalent to spin-flip up- 
dates (from to 1 or —1 and vice versa), thus allowing 
for a similar application of the sub-matrix update algo- 
rithm as in the case of the Hirsch-Fye solver—. This 
procedure is explained in more detail in Sec. Mil 



B. Random Walk 

The infinite sum over expansion orders n and the inte- 
gral and sum over vertices {(sj,Ti)} in Eq. ((6]) is com- 
puted to all orders in a stochastic Monte Carlo pro- 
cess: The algorithm samples time ordered configurations 
{{ s i->n)} with weight 




II. SUBMATRIX UPDATES 

To derive the sub-matrix updates^ let us consider a 
typical step k of the algorithm at which the interaction 
[with spin and time {s Pk , T Pk ) of m interaction vertices] is 
changed from V Pk to V pk . The new matrix G k+1 is then 
given by Eq. (jllaj) . 

G^ +1 = G$j + {G k ipk - S iPk )X k Gp kj , (15) 
A* = e^- V **-1. 
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X k denotes the change of interaction at step k. 
We proceed by showing how the determinant ratio 
det N k / dot N k+1 of Eq. (JT3J) as well as the new matrix 
N k+1 arc computed efhciently using the Woodbury for- 
mula: We define an inverse matrix A of G, analyze its 
changes during an update, and show how they can be 
incorporated in a small (k x k) matrix T that is easily 
computed by accessing only k 2 <C m 2 matrix elements in 
each step. The inverse of this matrix is then iteratively 
computed cither by employing an LU decomposition, or 
a partitioning scheme. 

A change to the inverse Green's function matrix A k = 
(G*)- 1 is of the formic 



7 



A% + -i k A k J. 



(16) 



ip u P3 



-7<t0 



1. 



j k , similar to X k above, contains the information about 
the changed interaction at step k. Eq. (|16[) is commonly 
known as the Sherman Morrison formula and illustrated 



in Fig. |l(a)| We define A k } 



A%+l k A k p 5 PJ , i.e. the 



matrix A k where the p-th column is multiplied by (1 + 
7 fe ), and therefore det^' = (1 + j k ) det(A k ). We then 
rewrite Eq. fll6j) as A^ 1 = A^ - j k S ip 6 P j, and, using 



y y 

the "matrix determinant lemma" det(A y +UiVj) 
vi(A~ 1 )i q u q ] det Aij, we have 

detA fe+1 = dct(i fe )dct(l - 7 fc [(i fc )- 1 ]pp) 



[1 + 
(17) 



dctA fc (l+7 fe ) 1 



7 



det^ fc 7 



G 



1+7* 

7 



This formula yields the determinant ratio 



det N k 
dctiY fc+1 



= -7 



G, 



i + r 



(18) 



needed in Eq. (fl3| for the acceptance or rejection of an 
update. 

We can recursively apply Eq. (|17l) to obtain an expres- 
sion for performing multiple interaction changes, as long 
as they occur for different spins Pi ^ pj(i ^ j): 



A 



k+l 



A; 



Ak 



)S P 



V k 



A k 



y^7 {Ai Pl 

1=0 
k 

1=0 

X k (Y k ) T , 



(19) 
(20) 
(21) 



The new matrix A k+1 is therefore generated from A by 
successively multiplying columns pi, < I < k of A with 



j l and adding constants to the diagonal. X and Y T 
are index matrices that label the changed spins and keep 
track of a prefactor j k . 

For measurements we need access to the Green's func- 
tion G, not its inverse A. It is obtained after /c max steps 
by applying the Woodbury formula Eq. (f2"2"j) to Eq. (TT9"]) : 
with q denoting a Woodbury step combining /c max vertex 
update steps: 

G q+l = (A q+l )- 1 

= A- 1 +A- 1 X(l-Y T A- 1 xy 1 Y T A-\ (22) 
G q+1 = G + GX(l — Y T GX)~ 1 Y T G, (23) 

where G = A" 1 . After some simplification, Eq. (f23| can 
be shown to be 



^1/ — D i 1 {G i3 - G ipk T p ^ pi G Pl j) . (24) 
max x fc max - matrix T, defined 



Here we have introduced a k 

as 



G»(p,q)-S p 



Jp 



Jp 



(25) 



and a vector D that is 1 everywhere but at positions 
where auxiliary spins are changed: 



1 



1 



(26) 



Note that G° is the interacting Green's function at step 
k = and not the bare Green's function Q° of the effec- 
tive action, unless all auxiliary spins are zero. 

Translating this Green's function formalism to a for- 
malism for N matrices is straightforward: writing G = 
NQ° and multiplying Eq. from the right with (G )' 1 
yields 



N q+1 
y 



{Nij Gi Pk T pkP[ N Pl j) : 



(27) 



where one Gi Pk remains in Eq. (|27j) . This equation is 



illustrated in Fig. 1(b) 



Inserting G = NG Q into Eq. (|TTaj) and setting V = 
(N' = 1) we obtain: 



7Ve v - iVG e v 



iVG 



(iVG ) l: 
AT,:. 



G^d- 



-%)/(e 



v, 



Gi 



(28) 
(29) 
(30) 



The computation of G from ./V in this manner fails if the 
interaction Vj is zero. In this case we need to compute 
Gij = NikG^j at a cost of O(N) for each i and j. 



A. Determinant Ratios and Inverse Matrices 

To either accept or reject a configuration change, we 
need to compute the determinant ratio det N k+1 / det N k 
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[Eq. (p])]. Following Ref. [H we write: 



2. Inversion by Partitioning 



det A 



k+l 



(-1) 



1 JJij-det A°dctr fc . (31) 



The computation of the determinant det T is an expen- 
sive 0(k 3 ) operation, if T k has to be recomputed from 
scratch. However, we successively build T k by adding 
rows and columns. In the following we present two ef- 
ficient (and as far as we could see equivalent) methods 
to iteratively compute determinant ratios of T: keeping 
track of an LU decomposition, and storing the inverse 
computed using inversion by partitioning. 



1. LU decomposition 

For each accepted update we keep track of a LU de- 
composition of F: 



w 



jk-X 

o p 



Ly = s, 
U T x = w, 



f3 = G°(p k ,p k ) 



1 + 7* 



T 

x y 



(32) 

(33) 
(34) 

(35) 



where both x T and y are computed in 0(k 2 ) by solving a 
linear equation for a triangular matrix. The determinant 
ratio needed for the acceptance of an update is 



det A k+1 
det A k 



(36) 



These updates have been formulated for spins that 
have only been updated once. In the case where the 
same spin is changed twice or more, rows and columns in 
r, or L and U, need to be modified. These changes are 
of the form T — » T + uv T , and Bennett's algorithm^ can 
be used to re-factorize the matrix. 

The probability to accept/reject a (k + l)-th spin re- 
quires 0(k 2 ) operations [computation of x and y us- 
ing Eqs. (|33|) and (|34)) requires 0(k 2 ) operations, while 
Eq. (|3"5|) requires O(k) operations]. On the other hand, 
the "delayed" algorithm requires 0(km) operations to 
compute the acceptance rate of a (k + l)-th spin flip, for 
a matrix of size to. In this sense, the submatrix update 
methodology not only manages matrix operations effi- 
ciently, but also improves the computational efficiency of 
the spin-flip acceptance rate. 



Alternatively, we can compute the inverse of T by em- 
ploying the Sherman-Morrison formula: 



p = (d- 



w 



+ (r,7 1 s )r> T r\7 1 ) -r^s/r 



fe+i 



det r fe+1 
detr fe 



det A k+1 
det A k 



-r/3. 



(37) 



(38) 



(39) 



Although both methods obtain the acceptance rates of 
Eqs. (|36p and (|39|) in 0(k 2 ) steps, inversion by parti- 
tioning requires an additional step of updating the r^*, lf 
and hence is expected to be slower than the LU de- 
composition approach. However, the complication of re- 
orthogonalizing the LU factorized matrix using Bennett's 
algorithm docs not arise. 



III. THE RANDOM WALK WITH SUBMATRIX 
UPDATES 

The sums and integrals of Eq. (|6]) are computed by a 
random walk in the space of all expansion orders, auxil- 
iary spins, and time indices. In the cluster case, configu- 
rations acquire an additional site index. A configuration 
Cfc at expansion order n contains n interaction vertices 
with spins, sites, and time indices: 



{(ri, Sl,(7l), • • • (T n , S n , <7„)}. 



(40) 



The configuration space C consists of all integrands / 
summands in Eq. ((6]), which we can represent by sets of 
triplets of numbers, consisting of auxiliary spins, times, 
and site indices: 

C = {co,-- - ,Cfc((ri,Si,o-i),-- - ,(T k ,s k ,a k )),- ■ ■}. (41) 

To efficiently make use of the submatrix updates, we 
add an additional step before insertion and removal up- 
dates are performed. In this preparation step, we insert a 
number fc max of randomly chosen non-interacting vertices 
with auxiliary spin s = 0, which, as discussed in Sec. U 
does not change the value of the partition function. Once 
these vertices are inserted, insertion and removal updates 
at the locations of the pre-inserted non-interacting ver- 
tices become identical to spin-flip updates: an insertion 
update of a spin s = 1 now corresponds to a spin-flip 
update from spin s = to spin s = 1, and similar for re- 
moval updates. This pre-insertion step of non- interacting 
vertices then allows for a similar application of submatrix 
updates as in the case of the Hirsch-Fye algorithm. 

To accommodate this pre-insertion step, we split our 
random walk into an inner and an outer loop. In the 
outer loop (labeled by q) we perform measurements of 
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observables and run the preparation step discussed above 
as well as recompute steps. These steps are described in 
more detail below. In the inner loop (labeled by k) we 
perform k max insertion, removal, or spin- flip updates at 
the locations of the pre-inserted non-interacting spins. 
It is best to choose (m) 3> fc max > 1 so the blocking 
becomes efficient, but matrices of linear size fc max are 
small enough to fit into the cache. 



A. Preparation steps 

We begin a Monte Carlo sweep with preliminary com- 
putations for spins that we will propose to insert or re- 
move. For this, we generate randomly a set of £;™ ax pairs 
of (site, time) indices, where fc™ x denotes the maximum 
insertions possible. We then compute the additional rows 
of the matrix N for these noninteracting spins: 



N 



N 
R 1 



(42) 



where R is a matrix of size n x /c™| x containing the con- 
tributions of newly added noninteracting spins, 



R, 



(43) 



at the cost of 0(n 2 k^ K ), as well as the Green's function 
matrix G = NQ° for the new spins (cost n 2 k? 



umax\ 



B. Insertion, removal, spinflip of an auxiliary spins 

Vertex insertion updates are performed by proposing 
to flip one of the newly inserted non-interacting spins 
from value zero to either plus or minus one. The deter- 
minant ratio is obtained by using Eqs. ([22]) . ([24"]) , ([2B1) . 
and (|35[) . (i.e., by the solution of a linear equation of a 
triangular matrix). If the update is accepted the auxil- 
iary spin is changed and the matrix T is enlarged by a 
row and a column. 

Starting from a configuration Ck = 
{(n,Si,ai),---(rk,Sk,(Xk)} we propose to remove 
the interaction vertex (tj , s j , cry) . The ratio of the two 
determinants [Eq. ([55]) ] is computed by proposing to flip 
an auxiliary spin from ±1 to zero. For this we compute 
s and w as in Eq. ([25]) . and then compute x and y 
by solving a linear equation for a triangular system 
[Eqs. ([32]) and ([21])]. Finally, Eq. (|2H]) is computed using 
Eq. (|35|) . If the update is accepted the auxiliary spin is 
set to zero and T is enlarged by a row and a column. 

Double vertex updates required for the scheme of 
Ref. [3^ proceed along the same lines and enlarge T by 
two rows and two columns. 

To perform a spin-flip update we choose a currently 
interacting spin with value ±1 and propose to flip it to =Fl 
using Eqs. ([25]) . ([2^]) . and (]25)) . If the update is accepted, 
r grows by a row and a column. 



C. Recompute step 

This scheme of insertion, removal, and spinflip updates 
is repeated /c max times. With each accepted move the 
matrix T grows by a row and a column. To keep the al- 
gorithm efficient we periodically recompute the full ./V 
matrix using the Woodbury formula 1 2 71 



(44) 



as r grows with every accepted update, and the cost of 
computing determinant ratios is 0(k 2 ). The recompute 
step consists of two inversions for L and U, which are 
both 0(k 2 ) operations, and two matrix multiplications, 
at cost 0(k 2 N) and 0(N 2 k) respectively. Noninteract- 
ing auxiliary spins can then be removed from Nf^~ by 
deleting the corresponding rows and columns. 



D. Measurements 

At the end of a sweep, if the system is thermalizcd, 
observable averages are computed. As the complete N- 
matrix is known at this point, the formulas presented in 
Ref. [31] are employed without change. In most calcula- 
tions, the computation of the Green's function is the most 
expensive part of the measurement. In large "dynami- 
cal cluster approximation" (DCA) 16 ' 18 ' 20 calculations it 
is therefore advantageous to compute directly the Green's 
functions in cluster momenta, of which there are only N c , 
in contrast to the N 2 real-space Green's functions. Also, 
on large clusters, Green's functions arc best measured 
directly in Matsubara frequencies. 



IV. RESULTS 

We present two types of results. First we examine 
the performance of submatrix updates in practice, using 
several scaling metrics. We then illustrate a physics ap- 
plication where we test the DCA approximation on large 
clusters, showing cluster size dependence and extrapola- 
tions to the infinite system limit. 



A. Scaling of the algorithm 

Two types of scaling are commonly analyzed in high 
performance computing: the so-called "weak" scaling, 
which defines how the the time to solution varies when 
the resources arc increased commensurately with the 
problem size, and the "strong" scaling, which is defined 
as how the time to solution decreases with an increasing 
amount of resources for fixed problem size. 

We begin by analyzing the scaling of the time to so- 
lution for fixed resources but varying problem size. As 
"problem size" we consider the average expansion order 
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FIG. 2. Time per update (in arbitrary units) for submatrix 
and rank one CT-AUX updates. Open circles (black online): 
rank one updates. Filled diamonds, triangles, squares, and 
left triangles: submatrix updates for fc max = 32, 64, 128, and 
256. Dashed line: ideal 0(k 2 ) scaling, arbitrary prefactor. 



FIG. 4. Time to solution as a function of the number of 
CPUs, for a 16-site cluster impurity problem. Squares (black 
online): U/t = 8, /3t = 10 {(k) = 550), half filling. Circles 
(red online): fit = 20 ((k) = 1100). The dashed lines show 
the ideal scaling. 
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FIG. 3. Updates per time as a function of k n 
U/t — 8, for temperatures indicated. 



16-site cluster, 



or matrix size, (k). The average expansion order is re- 
lated to the potential energy and therefore extensive in 
cluster size. For systems with small average expansion 
orders (N < 200), the entire matrix fits into the cache, 
and therefore there is no advantage in using submatrix 
updates. With increasing average matrix size caching 
effects become more important. 

Figure [2] shows the strong scaling, as the time per up- 
date (in arbitrary units) as a function of the expansion or- 
der (matrix size), for rank-one updates and several fc max . 
The ideal scaling is 0(k 2 ) per update, or 0(k 3 ) for (k) up- 
dates needed to decorrelate a configuration^ The scaling 
per update is indicated by the dashed line. 

Submatrix updates are, for problems with expansion 
orders between 512 and 2048, about a factor of eight 
faster than straightforward rank one updates. 

For small expansion order CT-AUX with and with- 
out submatrix updates behave similarly. For expansion 
orders of 256 and larger, the speed increase from subma- 
trix updates becomes apparent, and at expansion orders 
of 512 and larger the difference with and without subma- 
trix updates corresponds to the difference of data transfer 
rates between the cache and CPU and the main memory 



and CPU, or the difference at which memory intensive 
(Sherman - Morrison-like vector operations) and CPU 
intensive (Woodbury-likc matrix operations) run. 

The optimal choice of the expansion parameter fc max 
for the test architecture lies somewhere between 64 and 
128 (performance is relatively insensitive to the exact 
choice of fc max ). This is also illustrated in Fig. [3j for 
a small choice of fc max the Woodbury matrix-matrix op- 
erations do not dominate the calculation and the algo- 
rithm is similar to CT-AUX, where much time is spent 
idling at memory bottlenecks. Caching effects get more 
advantageous for larger fc max , until for fc max > 128 most 
of the time is spent updating and inverting the T ma- 
trices. Note, however, that the optimal value of /c max is 
expected to depend on architectural details such as the 
size of the cache. 

In Fig. [4] we present a strong scaling curve by show- 
ing the time to solution (in seconds) for two problem sets 
(symbols), as well as the ideal scaling (dashed lines), as a 
function of the number of CPUs employed. This time in- 
cludes communications and thermalization overhead that 
does not scale with the number of processors. This is 
the part that according to Amdahl's law^ leads to less 
than ideal scaling behavior. CT-AUX has a remarkably 
small thermalization time and is therefore ideally suited 
for parallelization on large machines. As can be seen, 
for the chosen problem sizes, the algorithm can be scaled 
almost ideally to at least 10,000 CPUs. Note, however, 
that the scaling behavior is expected to depend critically 
on the number and type of measurements that are per- 
formed. This is because the measurements are perfectly 
parallel, since they are only performed once the calcula- 
tion is thermalized. Here, we have restricted the mea- 
surements to the single-particle Green's function. If, in 
addition, more complex quantities such as two-particle 
observables are measured, the simulation run-time will 
be dominated by the measurements and the ideal scaling 
behavior is expected to continue to much larger processor 
counts. 
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FIG. 5. Extrapolation of the cluster energy as a function of 
cluster size, at U/t = 8, T/t = 0.5. Dashed line: DMFT re- 
sults. Circles (black online) denote DCA results from clusters 
with size 18, 26, 36, 56, 64, 84, and 100. Solid line (green 
online): least squares fit. Triangle (blue online): extrapo- 
lated result. The error bar denotes the fitting error; statistical 
(Monte Carlo) errors are smaller than symbol size. 



B. Simulations of the 3D Hubbard model 

As an illustration of the power of the algorithm we 
present results from a calculation of the Neel temperature 
of the three-dimensional Hubbard model at half filling, 
within the DCA approximation, as a function of cluster 
size. 

A comprehensive study, showing DCA data at and 
away from half filling, for interaction strengths up to 
U ~ 12 and clusters of size < 64, will be published 
elsewhere^. The results we present here are for tem- 
perature T = t (far above Tn), for T = 0.5t, and for 
T = 0.354. The lowest temperature is close to the Neel 
temperature, and long ranged correlations cause a slow 
convergence. The results were obtained on 128 CPUs 
in one hour per iteration. In Fig. [5] we show the ex- 
trapolation of the energy for several cluster sizes and 
an extrapolation to the infinite cluster size limit. The 
plot shows that controlled extrapolations to the thermo- 
dynamic limit^r— can be obtained in practice. Monte 
Carlo errors are much smaller than the symbol size. 

Fig. [5] shows self-energy cuts along the main axes in re- 
ciprocal space. Plotted are results for single site DMFT 
and clusters of size 18, 84, and 100, interpolated using 
Akima splines. While momentum averaged quantities 
like the energy in Fig. [5] show clear convergence and the 
possibility for extrapolation, convergence is not uniform 
in all quantities. The high temperature self-energy plot- 
ted in panel[SJt is clearly converged as a function of cluster 
size, the intermediate temperature self-energy plotted in 
panel shows some cluster size dependence, and the 
right panel HJ: shows a self-energy that even for 100 clus- 
ter sites is not yet converged (a sign of the long wave- 
length physics important near TV). Reliable extrapola- 
tion of the cluster self-energy to the £(&,£<;) of the infi- 



nite system would require even larger clusters. Further 
insight can be gained from the frequency dependence of 
the Matsubara self-energy (Fig. [7]). Plotted is the fre- 
quency dependence at three points in the Brillouin zone. 
While a significant cluster size dependence is observed at 
low frequencies, the results converge to the local DMFT 
limit at high frequencies, as one would expect. 



V. CONCLUSIONS 

We have presented a variation of the CT-AUX algo- 
rithm that, while mathematically equivalent, arranges 
operations in such a manner that they are ideally suited 
for modern computational architectures. For large prob- 
lem sizes, this "submatrix" algorithm achieves a signifi- 
cant performance increase relative to the traditional CT- 
AUX algorithm, by replacing the slow rank-one updates 
by faster matrix-matrix operations. Our implementa- 
tion of the submatrix updates in the CT-AUX algorithm 
requires an additional preparation step in which non- 
interacting vertices with auxiliary spins s = are intro- 
duced. After this step, the CT-AUX vertex insertion and 
removal updates become equivalent to spin-flip updates. 
The submatrix algorithm then proceeds by manipulat- 
ing the inverse of the Green's function matrix, for which 
changes under auxiliary spin flips are completely local. 
This allows for a significantly faster computation of the 
QMC transition probabilities under a spin-flip update. 
The algorithm keeps track of a number k of these local 
changes, similar to the delayed update algorithm, and 
then performs a Green's function update as a matrix- 
matrix multiplication. 

Because this algorithm requires additional overhead 
over the traditional CT-AUX implementation, there is 
an optimal choice for the maximum number of spin-flip 
updates fc ma x per Green's function update which depends 
on problem size and architectural parameters such as the 
cache size. For the test architecture we have used, we 
have found that fc max w 128 for large problem sizes. For 
this optimal value, we find a speed increase up to a factor 
of 8 relative to the traditional CT-AUX algorithm. 

We have shown that simulations for large interacting 
systems, previously requiring access to high performance 
supercomputers, become feasible for small cluster archi- 
tectures, and we have demonstrated the scaling on su- 
percomputers that shows that, by using the submatrix 
algorithm, continuous-time quantum Monte Carlo meth- 
ods are almost ideally adapted to high performance ma- 
chines. 

As an example we have shown how some cluster dy- 
namical mean field theory quantities, like the energy, can 
be reliably extrapolated to the thermodynamic limit, and 
how for other quantities, like the self-energy, even large 
cluster calculations are not sufficient to obtain converged 
extrapolations. 

The algorithm is similarly suited to the solution of lat- 
tice problems [i.e., problems where V pa — and where 
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FIG. 6. Real and imaginary parts of the lowest Matsubara frequency of the interpolated DCA cluster self-energy T,(k,iujo) of a 
3D Hubbard model above the Neel temperature^, for U/t = 8, T/t = 1 (left panel), T/t = 0.5 (middle panel), and T/t = 0.35 
(right panel) , at half filling. The lines denote DMFT results (horizontal straight lines) and results for clusters of size 18, 84, 
and 100. The interpolation follows a path along the high-symmetry points F — (0,0,0), X = (tt, 0, 0), M = (7r,7r,0), and 

R = (-7T, 7T, 7r). 



no (cluster) dynamical mean field self-consistency is im- 
posed] . 

Our results arc also readily generalized to the interac- 
tion expansion formalism developed in Refs. [13 and [28l . 
offering the possibility to significantly accelerate simula- 
tions of multi-orbital systems. 
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